Central North Sea
CodeBC: A More Secure Large Language Model for Smart Contract Code Generation in Blockchain
Wang, Lingxiang, Zhang, Hainan, Zhang, Qinnan, Wang, Ziwei, Zheng, Hongwei, Dong, Jin, Zheng, Zhiming
Large language models (LLMs) excel at generating code from natural language instructions, yet they often lack an understanding of security vulnerabilities. This limitation makes it difficult for LLMs to avoid security risks in generated code, particularly in high-security programming tasks such as smart contract development for blockchain. Researchers have attempted to enhance the vulnerability awareness of these models by training them to differentiate between vulnerable and fixed code snippets. However, this approach relies heavily on manually labeled vulnerability data, which is only available for popular languages like Python and C++. For low-resource languages like Solidity, used in smart contracts, large-scale annotated datasets are scarce and difficult to obtain. To address this challenge, we introduce CodeBC, a code generation model specifically designed for generating secure smart contracts in blockchain. CodeBC employs a three-stage fine-tuning approach based on CodeLlama, distinguishing itself from previous methods by not relying on pairwise vulnerability location annotations. Instead, it leverages vulnerability and security tags to teach the model the differences between vulnerable and secure code. During the inference phase, the model leverages security tags to generate secure and robust code. Experimental results demonstrate that CodeBC outperforms baseline models in terms of BLEU, CodeBLEU, and compilation pass rates, while significantly reducing vulnerability rates. These findings validate the effectiveness and cost-efficiency of our three-stage fine-tuning strategy, making CodeBC a promising solution for generating secure smart contract code.
DataDRILL: Formation Pressure Prediction and Kick Detection for Drilling Rigs
Arifeen, Murshedul, Petrovski, Andrei, Hasan, Md Junayed, Kotenko, Igor, Sletov, Maksim, Hassard, Phil
Accurate real-time prediction of formation pressure and kick detection is crucial for drilling operations, as it can significantly improve decision-making and the cost-effectiveness of the process. Data-driven models have gained popularity for automating drilling operations by predicting formation pressure and detecting kicks. However, the current literature does not make supporting datasets publicly available to advance research in the field of drilling rigs, thus impeding technological progress in this domain. This paper introduces two new datasets to support researchers in developing intelligent algorithms to enhance oil/gas well drilling research. The datasets include data samples for formation pressure prediction and kick detection with 28 drilling variables and more than 2000 data samples. Principal component regression is employed to forecast formation pressure, while principal component analysis is utilized to identify kicks for the dataset's technical validation. Notably, the R2 and Residual Predictive Deviation scores for principal component regression are 0.78 and 0.922, respectively.
Mining Java Memory Errors using Subjective Interesting Subgroups with Hierarchical Targets
Remil, Youcef, Bendimerad, Anes, Chambard, Mathieu, Mathonat, Romain, Plantevit, Marc, Kaytoue, Mehdi
Software applications, especially Enterprise Resource Planning (ERP) systems, are crucial to the day-to-day operations of many industries. Therefore, it is essential to maintain these systems effectively using tools that can identify, diagnose, and mitigate their incidents. One promising data-driven approach is the Subgroup Discovery (SD) technique, a data mining method that can automatically mine incident datasets and extract discriminant patterns to identify the root causes of issues. However, current SD solutions have limitations in handling complex target concepts with multiple attributes organized hierarchically. To illustrate this scenario, we examine the case of Java out-of-memory incidents among several possible applications. We have a dataset that describes these incidents, including their context and the types of Java objects occupying memory when it reaches saturation, with these types arranged hierarchically. This scenario inspires us to propose a novel Subgroup Discovery approach that can handle complex target concepts with hierarchies. To achieve this, we design a pattern syntax and a quality measure that ensure the identified subgroups are relevant, non-redundant, and resilient to noise. To achieve the desired quality measure, we use the Subjective Interestingness model that incorporates prior knowledge about the data and promotes patterns that are both informative and surprising relative to that knowledge. We apply this framework to investigate out-of-memory errors and demonstrate its usefulness in incident diagnosis. To validate the effectiveness of our approach and the quality of the identified patterns, we present an empirical study. The source code and data used in the evaluation are publicly accessible, ensuring transparency and reproducibility.
Policy Synthesis and Reinforcement Learning for Discounted LTL
Alur, Rajeev, Bastani, Osbert, Jothimurugan, Kishor, Perez, Mateo, Somenzi, Fabio, Trivedi, Ashutosh
The difficulty of manually specifying reward functions has led to an interest in using linear temporal logic (LTL) to express objectives for reinforcement learning (RL). However, LTL has the downside that it is sensitive to small perturbations in the transition probabilities, which prevents probably approximately correct (PAC) learning without additional assumptions. Time discounting provides a way of removing this sensitivity, while retaining the high expressivity of the logic. We study the use of discounted LTL for policy synthesis in Markov decision processes with unknown transition probabilities, and show how to reduce discounted LTL to discounted-sum reward via a reward machine when all discount factors are identical.
Locksynth: Deriving Synchronization Code for Concurrent Data Structures with ASP
Varanasi, Sarat Chandra, Mittal, Neeraj, Gupta, Gopal
We present Locksynth, a tool that automatically derives synchronization needed for destructive updates to concurrent data structures that involve a constant number of shared heap memory write operations. Locksynth serves as the implementation of our prior work on deriving abstract synchronization code. Designing concurrent data structures involves inferring correct synchronization code starting with a prior understanding of the sequential data structure's operations. Further, an understanding of shared memory model and the synchronization primitives is also required. The reasoning involved transforming a sequential data structure into its concurrent version can be performed using Answer Set Programming and we mechanized our approach in previous work. The reasoning involves deduction and abduction that can be succinctly modeled in ASP. We assume that the abstract sequential code of the data structure's operations is provided, alongside axioms that describe concurrent behavior. This information is used to automatically derive concurrent code for that data structure, such as dictionary operations for linked lists and binary search trees that involve a constant number of destructive update operations. We also are able to infer the correct set of locks (but not code synthesis) for external height-balanced binary search trees that involve left/right tree rotations. Locksynth performs the analyses required to infer correct sets of locks and as a final step, also derives the C++ synchronization code for the synthesized data structures. We also provide a performance analysis of the C++ code synthesized by Locksynth with the hand-crafted versions available from the Synchrobench microbenchmark suite. To the best of our knowledge, our tool is the first to employ ASP as a backend reasoner to perform concurrent data structure synthesis.
Computer Vision - Richard Szeliski
As humans, we perceive the three-dimensional structure of the world around us with apparent ease. Think of how vivid the three-dimensional percept is when you look at a vase of flowers sitting on the table next to you. You can tell the shape and translucency of each petal through the subtle patterns of light and shading that play across its surface and effortlessly segment each flower from the background of the scene (Figure 1.1). Looking at a framed group por- trait, you can easily count (and name) all of the people in the picture and even guess at their emotions from their facial appearance. Perceptual psychologists have spent decades trying to understand how the visual system works and, even though they can devise optical illusions1 to tease apart some of its principles (Figure 1.3), a complete solution to this puzzle remains elusive (Marr 1982; Palmer 1999; Livingstone 2008).
Fixed-Point Code Synthesis For Neural Networks
Benmaghnia, Hanane, Martel, Matthieu, Seladji, Yassamine
Over the last few years, neural networks have started penetrating safety critical systems to take decisions in robots, rockets, autonomous driving car, etc. A problem is that these critical systems often have limited computing resources. Often, they use the fixed-point arithmetic for its many advantages (rapidity, compatibility with small memory devices.) In this article, a new technique is introduced to tune the formats (precision) of already trained neural networks using fixed-point arithmetic, which can be implemented using integer operations only. The new optimized neural network computes the output with fixed-point numbers without modifying the accuracy up to a threshold fixed by the user. A fixed-point code is synthesized for the new optimized neural network ensuring the respect of the threshold for any input vector belonging the range [xmin, xmax] determined during the analysis. From a technical point of view, we do a preliminary analysis of our floating neural network to determine the worst cases, then we generate a system of linear constraints among integer variables that we can solve by linear programming. The solution of this system is the new fixed-point format of each neuron. The experimental results obtained show the efficiency of our method which can ensure that the new fixed-point neural network has the same behavior as the initial floating-point neural network.
Estimating g-Leakage via Machine Learning
Romanelli, Marco, Chatzikokolakis, Konstantinos, Palamidessi, Catuscia, Piantanida, Pablo
This paper considers the problem of estimating the information leakage of a system in the black-box scenario. It is assumed that the system's internals are unknown to the learner, or anyway too complicated to analyze, and the only available information are pairs of input-output data samples, possibly obtained by submitting queries to the system or provided by a third party. Previous research has mainly focused on counting the frequencies to estimate the input-output conditional probabilities (referred to as frequentist approach), however this method is not accurate when the domain of possible outputs is large. To overcome this difficulty, the estimation of the Bayes error of the ideal classifier was recently investigated using Machine Learning (ML) models and it has been shown to be more accurate thanks to the ability of those models to learn the input-output correspondence. However, the Bayes vulnerability is only suitable to describe one-try attacks. A more general and flexible measure of leakage is the g-vulnerability, which encompasses several different types of adversaries, with different goals and capabilities. In this paper, we propose a novel approach to perform black-box estimation of the g-vulnerability using ML. A feature of our approach is that it does not require to estimate the conditional probabilities, and that it is suitable for a large class of ML algorithms. First, we formally show the learnability for all data distributions. Then, we evaluate the performance via various experiments using k-Nearest Neighbors and Neural Networks. Our results outperform the frequentist approach when the observables domain is large.
SAR: Learning Cross-Language API Mappings with Little Knowledge
Bui, Nghi D. Q., Yu, Yijun, Jiang, Lingxiao
To save manual effort, developers often translate programs from one programming language to another, instead of implementing it from scratch. Translating application program interfaces (APIs) used in one language to functionally equivalent ones available in another language is an important aspect of program translation. Existing approaches facilitate the translation by automatically identifying the API mappings across programming languages. However, all these approaches still require large amount of manual effort in preparing parallel program corpora, ranging from pairs of APIs, to manually identified code in different languages that are considered as functionally equivalent. To minimize the manual effort in identifying parallel program corpora and API mappings, this paper aims at an automated approach to map APIs across languages with much less knowledge a priori needed than other existing approaches. The approach is based on an realization of the notion of domain adaption combined with code embedding, which can better align two vector spaces: taking as input large sets of programs, our approach first generates numeric vector representations of the programs, especially the APIs used in each language, and it adapts generative adversarial networks (GAN) to align the vectors from the spaces of two languages. For a better alignment, we initialize the GAN with parameters derived from optional API mapping seeds that can be identified accurately with a simple automatic signature-based matching heuristic. Then the cross-language API mappings can be identified via nearest-neighbors queries in the aligned vector spaces.